Chapter 4

Question 1, 2 and 3

dataset description:

The Boston research seems to look for correlations between the standard of living, housing quality, and other criteria such as the urban environment, accessibility, and demographics within the Boston area.

Each row represents data for a specific town and has information about various factors such as crime rate, proportion of non-retail business acres, property tax rates, pupil-teacher ratios, etc.

This data frame contains the following columns:

  • crim - per capita crime rate by town.

  • zn - proportion of residential land zoned for lots over 25,000 sq.ft.

  • indus - proportion of non-retail business acres per town.

  • chas - Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

  • nox - nitrogen oxides concentration (parts per 10 million).

  • rm - average number of rooms per dwelling.

  • age - proportion of owner-occupied units built prior to 1940.

  • dis - weighted mean of distances to five Boston employment centres.

  • rad - index of accessibility to radial highways.

  • tax - full-value property-tax rate per $10,000.

  • ptratio - pupil-teacher ratio by town.

  • black - 1000(Bk−0.63) square Bk is the proportion of blacks by town.

  • lstat - lower status of the population (percent).

  • medv - median value of owner-occupied homes in $1000s.

Data analyzes:

Explore the structure and the dimensions of the dataset. Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, the distributions of the variables and the relationships between them.

library(MASS)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
## 
##     select
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(tidyr)

setwd("/Users/margot/Desktop/Desktop - MacBook Pro de MARGOT/Open data with R 2023/IODS-project")

Boston <- Boston
# explore the mean, median, min, max per variable
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
# 506 entries with 14 variables
dim(Boston)
## [1] 506  14
# Data type: all num or int
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
# let's create points plot for each variable. -> not easy to read as too many variables
pairs(Boston)

# Let's analyze the histogram for each variable: 
gather(Boston) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + 
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Let's compute a correlation matrix
matrix_correlation_var <- cor(Boston)

# Visualize correlation matrix as a heatmap
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
ggplot(data = melt(matrix_correlation_var), aes(Var1, Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0)

according to the matrix - Negative relationships between certain variables:

lstat and medv - lower status of population and the median price of homes owned by occupants

lstat and rm - lower status of population and the average of room numbers per home

tax and medv - the percentage at which a property is taxed based on its assessed value and the the median price of homes owned by occupants

dis and lstat - the weighted average distance from each house to these employment centers and the lower status of population

dis and age - the weighted average distance from each house to these employment centers and the percentage or fraction of homes that are owner-occupied and were constructed before the year 1940.

dis and nox - the weighted average distance from each house to these employment centers and nitrogen oxides concentration.

dis and indus - the weighted average distance from each house to these employment centers and the proportion of non-retail business acres per town (manufacturing facilities, industrial parks, office spaces, warehouses etc).

tax and dis - the percentage at which a property is taxed based on its assessed value and the weighted average distance from each house to these employment centers

zn and age - the percentage or fraction of land within residential areas that is for individual lots with a minimum size of over 25,000 square feet and the percentage or fraction of homes that are owner-occupied and were constructed before the year 1940

zn and nox - the percentage or fraction of land within residential areas that is for individual lots with a minimum size of over 25,000 square feet and the nitrogen oxides concentration.

zn and indus - the percentage or fraction of land within residential areas that is for individual lots with a minimum size of over 25,000 square feet and the proportion of non-retail business acres per town (manufacturing facilities, industrial parks, office spaces, warehouses etc).

Positive relationships between variables

medv and rm - the median price of homes owned by occupants and the average of room numbers per home

tax and indus - the percentage at which a property is taxed based on its assessed value and the proportion of non-retail business acres per town.

tax and nox - the percentage at which a property is taxed based on its assessed value and the nitrogen oxides concentration.

age and indus - the percentage or fraction of homes that are owner-occupied and were constructed before the year 1940 and the proportion of non-retail business acres per town.

age and nox- the percentage or fraction of homes that are owner-occupied and were constructed before the year 1940 and the nitrogen oxides concentration.

nox and indus - the nitrogen oxides concentration and the proportion of non-retail business acres per town.

–> tests in my model: sltat, medv, rm, tax, dis, age, nox, indus, zn

Test some models based on the dataset and observed correlation

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
model1 <-  glm(lstat ~ medv + dis + rm,data=Boston)
model2 <-  glm(medv ~ rm + tax + lstat ,data=Boston)

# all the VIF are between 1 and 2.1 which is ok. It suggest a low multicollinearity and imply that the variance of the estimated coefficients is not significantly inflated due to collinearity
vif(model1)
##     medv      dis       rm 
## 1.982257 1.068810 1.940168
vif(model2)
##       rm      tax    lstat 
## 1.610953 1.426005 2.092901
#let’s calculate the corresponding ODDS ratios and confidence intervals (95%):
OR1 <- coef(model1) %>% exp
OR2 <- coef(model2) %>% exp
CI1 <- confint(model1) %>% exp
## Waiting for profiling to be done...
CI2 <- confint(model2) %>% exp
## Waiting for profiling to be done...
# the confidence interval for an odds ratio doesn't span 1 = there's a statistically significant effect for both models
cbind(OR1, CI1) 
##                      OR1        2.5 %       97.5 %
## (Intercept) 1.764803e+16 4.023272e+14 7.741285e+17
## medv        6.606608e-01 6.250484e-01 6.983022e-01
## dis         3.292580e-01 2.756488e-01 3.932934e-01
## rm          1.682751e-01 8.210665e-02 3.448749e-01
cbind(OR2, CI2)
##                     OR2        2.5 %      97.5 %
## (Intercept)   0.6073487  0.001289619 286.0319984
## rm          181.1868455 76.546116788 428.8744403
## tax           0.9935198  0.990167804   0.9968832
## lstat         0.5754723  0.522466602   0.6338557
# the residual deviance is way smaller than the null deviance. It indicates a reasonably good fit of the model to the data.
summary(model1)
## 
## Call:
## glm(formula = lstat ~ medv + dis + rm, data = Boston)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.40940    1.92918  19.391  < 2e-16 ***
## medv        -0.41451    0.02827 -14.662  < 2e-16 ***
## dis         -1.11091    0.09067 -12.252  < 2e-16 ***
## rm          -1.78215    0.36612  -4.868 1.51e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 17.22406)
## 
##     Null deviance: 25752.4  on 505  degrees of freedom
## Residual deviance:  8646.5  on 502  degrees of freedom
## AIC: 2882.2
## 
## Number of Fisher Scoring iterations: 2
# medv and rm also influences each other, so let's modify the model a bit
model11 <-  glm(lstat ~ medv + dis + rm + rm * medv ,data=Boston)
summary(model11)
## 
## Call:
## glm(formula = lstat ~ medv + dis + rm + rm * medv, data = Boston)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 74.41052    3.64466   20.42   <2e-16 ***
## medv        -1.94316    0.13517  -14.38   <2e-16 ***
## dis         -0.89569    0.08285  -10.81   <2e-16 ***
## rm          -7.55541    0.59817  -12.63   <2e-16 ***
## medv:rm      0.22526    0.01957   11.51   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 13.64918)
## 
##     Null deviance: 25752.4  on 505  degrees of freedom
## Residual deviance:  6838.2  on 501  degrees of freedom
## AIC: 2765.5
## 
## Number of Fisher Scoring iterations: 2
# same for this one, residual deviance is way smaller than the null deviance. 
summary(model2)
## 
## Call:
## glm(formula = medv ~ rm + tax + lstat, data = Boston)
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.498652   3.140239  -0.159 0.873895    
## rm           5.199529   0.439618  11.827  < 2e-16 ***
## tax         -0.006501   0.001724  -3.770 0.000182 ***
## lstat       -0.552564   0.049302 -11.208  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 29.90866)
## 
##     Null deviance: 42716  on 505  degrees of freedom
## Residual deviance: 15014  on 502  degrees of freedom
## AIC: 3161.4
## 
## Number of Fisher Scoring iterations: 2

Question 4

Standardizing the data. Variables often have different scales and units, making direct comparisons challenging. Standardization brings all variables to a common scale, allowing for fair comparisons between different variables. It makes the distribution of each variable more consistent, with a mean of 0 and a standard deviation of 1. This normalization aids in interpreting coefficients and comparing the relative importance of different predictors in regression. Standardizing ensures that each variable contributes equally to these techniques, preventing one variable from dominating the analysis due to its scale. It allows easier comparison of the magnitude of the effect of each variable on the outcome. Finally it can mitigate issues related to multicollinearity in regression analysis by putting variables on a similar scale, reducing the impact of differing scales on regression coefficients.

# select numerical values
Boston_numeric_cols <- Boston[, sapply(Boston, is.numeric)]

# The scale() to standardize and transform the data to have a mean of 0 and a standard deviation of 1.
scaled_boston <- scale(Boston_numeric_cols)

# convert to a data frame
scaled_table_boston <- as.data.frame(scaled_boston)

# how did the data change? Mean is now 0 so it has worked;
summary(scaled_table_boston)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# use the cut function to create categorical variables based on intervals or breaks in a numerical variable. We do this process for the crim variable from 0 to 0.25 to 0.50 to 0.75 to 1 (quantiles). Add labels for each category. 
# include.lowest = TRUE is to ensure there is no NA category.
quantiles <- quantile(Boston$crim, probs = seq(0, 1, by = 0.25), na.rm = TRUE)
interval_labels <- c("premier_quantil", "second_quantil", "third_quantil", "fourth_quantil")  

scaled_table_boston$quantiles_crime <- cut(Boston$crim, quantiles, labels= interval_labels,include.lowest = TRUE)
# Notes: Quantiles derived from numeric values can be considered categorical or continuous. When quantiles represent discrete categories or bins that divide a continuous variable into distinct groups, they are treated as categorical (e. small, medium, big). If quantiles represent numeric values that indicate the position or value relative to the distribution of a continuous variable (e.g., the 25th percentile, median, 75th percentile), they are considered continuous.

# drop the former column crim and create a new table
Boston_new <- scaled_table_boston 
# For some reasons, it  does not knit via the index file when I have this "select" function. However, the Knit works normally when I do it from the chapter 4 with or without this function. Therefore I have to remove it to be able to knit it via the index.rmd.
# In order to remove the crim, i do as below. 
#%>%select(-crim, everything())

# We need 80% of the rows from total rows
train_size <- round(0.8 * nrow(Boston_new))

# Select a sample randomly among the dataset 80%
train_set <- sample(seq_len(nrow(Boston_new)), size = train_size)

# Create training and testing subsets
train_data <- Boston_new[train_set, ]
test_data <- Boston_new[-train_set, ]

Question 5:

Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. Draw the LDA (bi)plot.

Linear Discriminant Analysis (LDA) can be quite useful in various real-life scenarios where classification or dimensionality reduction is required such as: Marketing and Customer Segmentation, product quality control, credit scoring.

It is ideal when we have multiple variables that can help us categorize each entry in one specific group where all others will have similar mean, and variance. The goal is to predict the categorical outcome or class based on a set of predictors.

LDA transforms the original variables into a smaller set of new variables, known as linear discriminants. These new variables (linear discriminants) are created in a way that maximizes the separation between different categories or classes in the data.By using the information captured in the linear discriminants, LDA helps in accurately assigning new observations (data points) to their respective categories or classes. LDA takes variables, combines them in a smart way to create new variables that are helpful for understanding differences between groups or categories, and then uses this understanding to categorize or classify new data based on what it has learned. This makes it useful for both simplifying data and making predictions about categories in that data.

LDA calculates the mean and variance for each predictor within each class or category in the target variable. It finds coefficients for these combinations by considering differences in means between classes and assumes equal variances for each predictor within classes. It assumes that the data approximately follows a normal distribution.

# check the impact of each variable on a categorical variable (quantiles_crime)
# each quantiles are approximately equal 25%
# LD1 explains 96% of the model. When LD1 captures 96% of the variance, it suggests that LD1 effectively separates the classes or groups in the data based on their mean differences and variance. Within LD1, data points are grouped together in a way that maximizes the separation between classes while minimizing the variation within each class.
# Positive coefficients indicate that higher values of that variable contribute to a higher score in that particular discriminant, while negative coefficients suggest the opposite. The larger the magnitude of the coefficient, the greater the impact of that variable on the discriminant.

library(MASS)

# Fit LDA on the train set - LDA is a statistical method used for classification. It fits an LDA model using the train_data, where quantiles_crime is the target variable to predict based on the other variables
lda_model <- lda(quantiles_crime ~ ., data = train_data)

# Predict on the test set by using the function predict to apply the model created from the train_data to test_data -> the idea here is to predict the class labels for this test_data
lda_pred <- predict(lda_model, test_data)$class # class extract the predicted class labels

# actual crime quantiles
actual_crime_categories <- test_data$quantiles_crime

# Extract LD scores from the LDA model's predictions
lda_scores <- predict(lda_model, test_data)$x # x accesses the matrix of posterior probabilities or scores associated with each class for the observations in the test_data.

# Create a dataframe with LD scores from first 2 columns of the lda_score and the predicted classes. Combining LD1, LD2, and the predicted classes for visualization. In many cases, visualizing beyond LD1 and LD2 might become complex to display effectively in two-dimensional plots. LD1 and LD2 are typically chosen for visualization as they capture the most discrimination power while allowing for a clearer visualization on a 2D plot.
plot_data <- data.frame(LD1 = lda_scores[, 1], LD2 = lda_scores[, 2], prediction_crime_quantile = as.factor(lda_pred))

plot_data
##            LD1         LD2 prediction_crime_quantile
## 1   -3.6981646 -0.21588536           premier_quantil
## 11  -1.8549894 -0.64235666            second_quantil
## 13  -2.3093229  0.08193752            second_quantil
## 16  -2.3252314 -0.31652410            second_quantil
## 21  -1.6342913 -0.76131683            second_quantil
## 30  -1.9693603 -0.90333329             third_quantil
## 31  -1.6704217 -0.78220979            second_quantil
## 37  -2.0363588  0.04951612            second_quantil
## 39  -2.3337243  0.24743348            second_quantil
## 53  -3.2896756  0.95744486           premier_quantil
## 61  -0.9979333  0.94196457            second_quantil
## 66  -3.2474880  3.21818139           premier_quantil
## 68  -3.2552531  1.07462253            second_quantil
## 77  -2.2219250 -0.04031855            second_quantil
## 87  -3.1225147  0.17517948            second_quantil
## 89  -3.2378959 -0.59639064            second_quantil
## 91  -3.2968089 -0.25352905            second_quantil
## 92  -3.2449264 -0.31929177            second_quantil
## 105 -1.6046883 -0.44975103            second_quantil
## 110 -1.5349080 -0.52304057            second_quantil
## 112 -1.5195115 -0.64414509             third_quantil
## 118 -1.4967890 -0.45005177            second_quantil
## 119 -1.3544853 -0.44841457            second_quantil
## 124 -2.0176217 -1.74719206             third_quantil
## 125 -2.2070971 -1.64017150             third_quantil
## 129 -1.4523991 -1.60548475             third_quantil
## 134 -1.4237891 -1.57518774             third_quantil
## 137 -1.4059599 -1.51141894             third_quantil
## 138 -1.4954072 -1.55873000             third_quantil
## 140 -1.3466158 -1.59380356             third_quantil
## 156 -0.7346977 -3.37308048             third_quantil
## 160 -0.6718932 -3.18898528             third_quantil
## 162 -1.3778419 -2.37158560             third_quantil
## 172 -1.4785048 -1.35654558             third_quantil
## 174 -1.9962113 -0.36433577            second_quantil
## 186 -2.5901737 -0.50733454            second_quantil
## 188 -2.5305333  1.58997730           premier_quantil
## 193 -2.8098884  1.33791225           premier_quantil
## 194 -4.3970014  2.11278193           premier_quantil
## 201 -3.4622306  3.08281738           premier_quantil
## 204 -2.6170319  2.41200852           premier_quantil
## 205 -2.6414814  2.36819724           premier_quantil
## 211 -2.3140662 -1.24044828            second_quantil
## 214 -2.7162027 -0.11001481            second_quantil
## 215 -2.3689587  0.09400501            second_quantil
## 218 -1.7323103 -1.21114146             third_quantil
## 228 -0.8854793 -0.44542675             third_quantil
## 241 -2.1901619  1.06814685           premier_quantil
## 243 -2.1389418  1.05757895            second_quantil
## 244 -2.7232748  1.70809598           premier_quantil
## 247 -1.8686151  1.01695278            second_quantil
## 251 -2.2085227  1.39465062           premier_quantil
## 261 -1.4927185 -1.07797310             third_quantil
## 262 -1.3102810 -1.52403537             third_quantil
## 264 -1.3895743 -1.17887954             third_quantil
## 266 -1.6882534 -0.22011015             third_quantil
## 283 -2.7350460 -0.54098416            second_quantil
## 287 -3.9499116  2.56895190           premier_quantil
## 292 -2.6525536  2.56893476           premier_quantil
## 296 -3.2342182 -0.08973777            second_quantil
## 306 -1.2898497  1.18528405           premier_quantil
## 311 -2.3639719  0.21012922            second_quantil
## 314 -2.1865934 -0.74299419            second_quantil
## 317 -1.9814437 -0.85310588            second_quantil
## 328 -2.2402841 -0.07105517            second_quantil
## 332 -4.1362653  1.27502515           premier_quantil
## 337 -2.1865592  0.01719251            second_quantil
## 341 -2.0898174 -0.06634654            second_quantil
## 343 -4.0671276 -0.57514751            second_quantil
## 344 -2.2082679  1.48784989           premier_quantil
## 350 -4.1116672  0.93556172           premier_quantil
## 354 -2.5554844  2.40411964           premier_quantil
## 355 -2.8506045  2.91735230           premier_quantil
## 359  5.7819841 -0.99339607            fourth_quantil
## 360  6.0943657 -0.48113829            fourth_quantil
## 371  5.7952693 -1.05182593            fourth_quantil
## 373  6.2162505 -1.15799680            fourth_quantil
## 379  6.4117941  0.46092310            fourth_quantil
## 383  6.2814224  0.17927658            fourth_quantil
## 384  6.3028241  0.11435689            fourth_quantil
## 392  6.0873817 -0.13169591            fourth_quantil
## 406  7.4276821  1.24639183            fourth_quantil
## 408  6.3471246  0.07786106            fourth_quantil
## 410  6.3877914  0.13568874            fourth_quantil
## 427  5.6983160  1.43541718            fourth_quantil
## 432  5.9955283  0.62587776            fourth_quantil
## 446  6.6916877 -0.34290394            fourth_quantil
## 450  6.1794826 -0.10222327            fourth_quantil
## 453  5.9912065 -0.09371909            fourth_quantil
## 454  6.0353461 -0.34815168            fourth_quantil
## 458  6.3435527  0.01689029            fourth_quantil
## 459  6.0315389  0.01170339            fourth_quantil
## 462  5.8714359 -0.10535634            fourth_quantil
## 471  5.4828233  0.68175899            fourth_quantil
## 473  5.3706256  0.68524014            fourth_quantil
## 480  5.8574602  0.65731836            fourth_quantil
## 481  5.0636158  1.20428580            fourth_quantil
## 483  5.0398335  0.94320653            fourth_quantil
## 499 -1.3238090 -0.53194343             third_quantil
## 500 -1.2091825 -0.49009274             third_quantil
## 503 -3.0093031 -1.01166402            second_quantil
# Create a scatterplot of LD1 and LD2
plot_LDA <- ggplot(plot_data, aes(x = LD1, y = LD2, color = prediction_crime_quantile)) +
  geom_point() +
  labs(title = "LDA Biplot with Predicted Crime Quantiles")

plot_LDA

# adding real values - comparison of actual vs predicted values in test_data
realVsPred_plot <- plot_LDA + 
  geom_point(aes(color = actual_crime_categories), size = 4, alpha = 0.1) +
  labs(color = "Real Quantiles of Crime")

realVsPred_plot

# the accuracy of predictions using test data
accuracy <- mean(lda_pred == test_data$quantiles_crime)
print(paste("Accuracy of LDA model on test data:", round(accuracy * 100, 2), "%"))
## [1] "Accuracy of LDA model on test data: 69.31 %"

Question 6:

Save the crime categories from the test set and then remove the categorical crime variable from the test dataset. Then predict the classes with the LDA model on the test data. Cross tabulate the results with the crime categories from the test set. Comment on the results. (0-3 points)

# save the crime data: 
actual_crime_categories <- test_data$quantiles_crime

# Remove the categorical crime variable from the test dataset
test_data_without_crime <- subset(test_data, select = -c(quantiles_crime))

# get the classes based on the model - this was done earlier with lda_pred. so I am a bit confused.
lda_pred_test <- predict(lda_model, test_data_without_crime)$class

# get the table with the prediction vs actual - results are same between the 2 ways since I did 2 times the same steps. I might have missed something in the requests.
cross_tab <- table(Predicted = lda_pred_test, Actual = actual_crime_categories)
cross_tab
##                  Actual
## Predicted         premier_quantil second_quantil third_quantil fourth_quantil
##   premier_quantil              16              4             0              0
##   second_quantil               11             15             9              0
##   third_quantil                 1              5            15              0
##   fourth_quantil                0              0             1             24
cross_table <- table(Predicted = lda_pred, Actual = actual_crime_categories)
cross_table
##                  Actual
## Predicted         premier_quantil second_quantil third_quantil fourth_quantil
##   premier_quantil              16              4             0              0
##   second_quantil               11             15             9              0
##   third_quantil                 1              5            15              0
##   fourth_quantil                0              0             1             24

Question 7:

Reload the Boston dataset and standardize the dataset (we did not do this in the Exercise Set, but you should scale the variables to get comparable distances). Calculate the distances between the observations. Run k-means algorithm on the dataset. Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results. (0-4 points)

# Boston is reload
Boston_load <- Boston

# scale of Boston
Boston_scaled <- scale(Boston_load)

# Calculate distance betwwen scaled data points: 
distances <- dist(Boston_scaled)

# Visualize the distances using fviz_dist()
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_dist(distances, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))

# k mean: The k-means algorithm groups observations into clusters based on their similarity, aiming to minimize the variation within clusters while maximizing the variation between clusters. We need to define the number of clusters we need to do the kmean analyse. 

# Elbow Method to find optimal number of clusters: Calculate the Within-Cluster-Sum of Squared Errors (WSS) for different values of k, and choose the k for which WSS becomes first starts to diminish. In the plot of WSS-versus-k, this is visible as an elbow.
# The Squared Error for each point is the square of the distance of the point from its representation i.e. its predicted cluster center.
# The WSS score is the sum of these Squared Errors for all the points.
wss <- numeric(10)  # Initialize within-cluster sum of squares vector
for (i in 1:10) {
  kmeans_model <- kmeans(Boston_scaled, centers = i, nstart = 10)
  wss[i] <- kmeans_model$tot.withinss  # Store within-cluster sum of squares
}

# Plotting the Elbow Method
plot(1:10, wss, type = "b", xlab = "Number of Clusters", ylab = "Within-cluster Sum of Squares", main = "Elbow Method")

# or we can use the direct function for the Elbow method
kmeans_model_optimal2 <- fviz_nbclust(Boston_scaled, kmeans, method = "wss")
# or we can use the silhuette method
fviz_nbclust(Boston_scaled, kmeans, method = "silhouette")

# Visualize clusters using pairs() or ggpairs()
pairs(Boston_scaled, col = kmeans_model$cluster)

k <- 2  # 2 seems to be the best option according to the Elbow Method and the silhouette method 

#Kmean model:
kmeans_model <- kmeans(Boston_scaled, centers = k, nstart = 25)

cluster_assignments <- kmeans_model$cluster

# visualize the clusters thanks to fviz_cluster function
fviz_cluster(kmeans_model, data = Boston_scaled)

library(GGally)

# Combine the scaled data with the cluster assignments (cluster 1 or 2)
clustered_data <- cbind(as.data.frame(Boston_scaled), Cluster = as.factor(cluster_assignments))

# Visualize clusters using ggpairs()
ggpairs(clustered_data, aes(color = Cluster))
## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Mean for each cluster and variable:
clustered_data %>%
  mutate(Cluster = kmeans_model$cluster) %>%
  group_by(Cluster) %>%
  summarise_all("mean")
## # A tibble: 2 × 15
##   Cluster   crim     zn  indus     chas    nox     rm    age    dis    rad
##     <int>  <dbl>  <dbl>  <dbl>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1       1 -0.389  0.262 -0.615  0.00291 -0.582  0.245 -0.433  0.454 -0.583
## 2       2  0.724 -0.487  1.14  -0.00541  1.08  -0.455  0.805 -0.844  1.08 
## # ℹ 5 more variables: tax <dbl>, ptratio <dbl>, black <dbl>, lstat <dbl>,
## #   medv <dbl>